First, we import required libraries and data. Data is also processed to a format which feels most suited to forecast.
library(vars)
## Le chargement a nécessité le package : MASS
## Le chargement a nécessité le package : strucchange
## Le chargement a nécessité le package : zoo
##
## Attachement du package : 'zoo'
## Les objets suivants sont masqués depuis 'package:base':
##
## as.Date, as.Date.numeric
## Le chargement a nécessité le package : sandwich
## Le chargement a nécessité le package : urca
## Le chargement a nécessité le package : lmtest
library(xlsx)
library(urca)
library(dplyr)
##
## Attachement du package : 'dplyr'
## L'objet suivant est masqué depuis 'package:MASS':
##
## select
## Les objets suivants sont masqués depuis 'package:stats':
##
## filter, lag
## Les objets suivants sont masqués depuis 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ggpubr)
# Read data file
df <- read.xlsx("/Users/edouardvilain/Desktop/3A - DSBA/T1 - ESSEC&Centrale/Forecasting & Predictive Analytics/Group Project/data.xlsx", sheetIndex=1)
mat <- as.matrix(df) # Transform into matrix and transpose
t_mat <- t(mat)
t_df <- as.data.frame(t_mat)
ts_df <- t_df[c(-1,-2,-3,-4,-5),c(-19)] # Remove unnecessary rows and column
ts_df <- as.data.frame(sapply(ts_df,as.numeric))
colnames(ts_df) <- c("Hobbies_CA_1","Household_1_CA_1","Household_2_CA_1","Foods_1_CA_1","Foods_2_CA_1","Foods_3_CA_1",
"Hobbies_CA_2","Household_1_CA_2","Household_2_CA_2","Foods_1_CA_2","Foods_2_CA_2","Foods_3_CA_2",
"Hobbies_CA_3","Household_1_CA_3","Household_2_CA_3","Foods_1_CA_3","Foods_2_CA_3","Foods_3_CA_3") # Set column names
We then split data into train and test set. Inspired by Exercise 1, we decide to choose the 80% first (in time) data rows as train data and the final 20% as test data.
split = 0.8
train_indices <- seq_len(length.out = floor(x = split * nrow(x = ts_df)))
train.df <- ts_df[train_indices,]
test.df <- ts_df[-train_indices,]
nrow(train.df)
## [1] 1575
nrow(test.df)
## [1] 394
Finally, we set the forecasting horizon to 1 for the entire Exercise and implement a function that will enable us to compute our models’ RMSSE scores.
# Set forecasting horizon to 1
forecast_h <- 1
# We define a function that computes RMSSE
rmsse <- function(train, pred, truth) {
if (class(pred) == "forecast") {
pred = pred$mean
}
rmse = sum((pred - truth)**2) / length(pred)
rmse = rmse**0.5
train.length = length(train)
denom = sum((train[1: train.length-1] - train[2: train.length])**2) / (train.length - 1)
rmsse = rmse / denom**0.5
rmsse
}
In this section, we aim to fit VAR models to store data. We thus consider 3 multivariate time series, each corresponding to a store and constituted by the 6 product time series. We compute column indexes of each store in the main dataframe:
store_1 <- c(1,2,3,4,5,6)
store_2 <- c(7,8,9,10,11,12)
store_3 <- c(13,14,15,16,17,18)
# Train data for each store
s1_df <- train.df[,store_1]
s2_df <- train.df[,store_2]
s3_df <- train.df[,store_3]
We can now select the lags of our VAR models using different information criteria (AIC, HQ, SC and FPE):
# Select lag length for each store (maximum lag is selected as the length of a month)
VARselect(s1_df, lag.max=31) # Information criteria equally suggest p=7 or p=15, we choose p=7 for simplicity reasons
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 15 7 7 15
##
## $criteria
## 1 2 3 4 5
## AIC(n) 4.608273e+01 4.562709e+01 4.551392e+01 4.542593e+01 4.526255e+01
## HQ(n) 4.613678e+01 4.572749e+01 4.566065e+01 4.561899e+01 4.550195e+01
## SC(n) 4.622804e+01 4.589697e+01 4.590835e+01 4.594492e+01 4.590610e+01
## FPE(n) 1.031512e+20 6.540283e+19 5.840481e+19 5.348569e+19 4.542449e+19
## 6 7 8 9 10
## AIC(n) 4.481313e+01 4.448648e+01 4.447395e+01 4.444463e+01 4.445780e+01
## HQ(n) 4.509886e+01 4.481855e+01 4.485235e+01 4.486937e+01 4.492887e+01
## SC(n) 4.558123e+01 4.537914e+01 4.549117e+01 4.558641e+01 4.572413e+01
## FPE(n) 2.898131e+19 2.090596e+19 2.064646e+19 2.005106e+19 2.031807e+19
## 11 12 13 14 15
## AIC(n) 4.447296e+01 4.447425e+01 4.442314e+01 4.432741e+01 4.430972e+01
## HQ(n) 4.499037e+01 4.503799e+01 4.503322e+01 4.498383e+01 4.501247e+01
## SC(n) 4.586385e+01 4.598970e+01 4.606314e+01 4.609198e+01 4.619884e+01
## FPE(n) 2.063026e+19 2.065879e+19 1.963163e+19 1.784192e+19 1.753169e+19
## 16 17 18 19 20
## AIC(n) 4.431783e+01 4.433782e+01 4.435731e+01 4.437600e+01 4.437991e+01
## HQ(n) 4.506692e+01 4.513324e+01 4.519907e+01 4.526409e+01 4.531434e+01
## SC(n) 4.633151e+01 4.647605e+01 4.662011e+01 4.676335e+01 4.689182e+01
## FPE(n) 1.767754e+19 1.803793e+19 1.839713e+19 1.874868e+19 1.882741e+19
## 21 22 23 24 25
## AIC(n) 4.435954e+01 4.437163e+01 4.437776e+01 4.438175e+01 4.439823e+01
## HQ(n) 4.534030e+01 4.539872e+01 4.545119e+01 4.550152e+01 4.556433e+01
## SC(n) 4.699601e+01 4.713265e+01 4.726334e+01 4.739189e+01 4.753293e+01
## FPE(n) 1.845335e+19 1.868394e+19 1.880567e+19 1.888849e+19 1.921058e+19
## 26 27 28 29 30
## AIC(n) 4.440833e+01 4.441632e+01 4.438015e+01 4.438536e+01 4.440586e+01
## HQ(n) 4.562076e+01 4.567509e+01 4.568525e+01 4.573680e+01 4.580364e+01
## SC(n) 4.766758e+01 4.780013e+01 4.788852e+01 4.801829e+01 4.816335e+01
## FPE(n) 1.941460e+19 1.958040e+19 1.889508e+19 1.900492e+19 1.941080e+19
## 31
## AIC(n) 4.442448e+01
## HQ(n) 4.586859e+01
## SC(n) 4.830652e+01
## FPE(n) 1.978888e+19
VARselect(s2_df, lag.max=31) # Information criteria equally suggest p=7 or p=14, we choose p=7 for simplicity reasons
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 14 7 7 14
##
## $criteria
## 1 2 3 4 5
## AIC(n) 4.214089e+01 4.162175e+01 4.145712e+01 4.131724e+01 4.122532e+01
## HQ(n) 4.219494e+01 4.172214e+01 4.160384e+01 4.151030e+01 4.146472e+01
## SC(n) 4.228620e+01 4.189162e+01 4.185155e+01 4.183623e+01 4.186887e+01
## FPE(n) 2.002415e+18 1.191508e+18 1.010649e+18 8.787335e+17 8.015717e+17
## 6 7 8 9 10
## AIC(n) 4.080979e+01 4.026782e+01 4.027386e+01 4.026563e+01 4.028346e+01
## HQ(n) 4.109552e+01 4.059989e+01 4.065227e+01 4.069036e+01 4.075453e+01
## SC(n) 4.157790e+01 4.116048e+01 4.129108e+01 4.140740e+01 4.154979e+01
## FPE(n) 5.290429e+17 3.077022e+17 3.095792e+17 3.070558e+17 3.126006e+17
## 11 12 13 14 15
## AIC(n) 4.030392e+01 4.030702e+01 4.028832e+01 4.017590e+01 4.020551e+01
## HQ(n) 4.082132e+01 4.087077e+01 4.089840e+01 4.083231e+01 4.090826e+01
## SC(n) 4.169481e+01 4.182247e+01 4.192832e+01 4.194046e+01 4.209463e+01
## FPE(n) 3.190883e+17 3.201118e+17 3.142152e+17 2.808408e+17 2.893261e+17
## 16 17 18 19 20
## AIC(n) 4.023293e+01 4.026071e+01 4.028461e+01 4.030562e+01 4.032188e+01
## HQ(n) 4.098201e+01 4.105613e+01 4.112636e+01 4.119371e+01 4.125631e+01
## SC(n) 4.224661e+01 4.239895e+01 4.254740e+01 4.269297e+01 4.283379e+01
## FPE(n) 2.974197e+17 3.058593e+17 3.133261e+17 3.200581e+17 3.253944e+17
## 21 22 23 24 25
## AIC(n) 4.030066e+01 4.031848e+01 4.033270e+01 4.034720e+01 4.037814e+01
## HQ(n) 4.128142e+01 4.134558e+01 4.140613e+01 4.146697e+01 4.154424e+01
## SC(n) 4.293713e+01 4.307951e+01 4.321828e+01 4.335734e+01 4.351284e+01
## FPE(n) 3.186590e+17 3.244961e+17 3.292623e+17 3.342059e+17 3.448561e+17
## 26 27 28 29 30
## AIC(n) 4.038429e+01 4.041424e+01 4.038399e+01 4.040261e+01 4.041830e+01
## HQ(n) 4.159672e+01 4.167301e+01 4.168909e+01 4.175405e+01 4.181607e+01
## SC(n) 4.364354e+01 4.379805e+01 4.389236e+01 4.403554e+01 4.417578e+01
## FPE(n) 3.471460e+17 3.578816e+17 3.474061e+17 3.541441e+17 3.599699e+17
## 31
## AIC(n) 4.044287e+01
## HQ(n) 4.188698e+01
## SC(n) 4.432492e+01
## FPE(n) 3.691738e+17
VARselect(s3_df, lag.max=31) # Information criteria equally suggest p=7 or p=14, we choose p=7 for simplicity reasons
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 14 7 7 14
##
## $criteria
## 1 2 3 4 5
## AIC(n) 4.921612e+01 4.888039e+01 4.877754e+01 4.867650e+01 4.852919e+01
## HQ(n) 4.927018e+01 4.898078e+01 4.892426e+01 4.886957e+01 4.876859e+01
## SC(n) 4.936144e+01 4.915026e+01 4.917197e+01 4.919549e+01 4.917274e+01
## FPE(n) 2.367507e+21 1.692325e+21 1.526925e+21 1.380207e+21 1.191173e+21
## 6 7 8 9 10
## AIC(n) 4.820137e+01 4.798717e+01 4.796456e+01 4.795829e+01 4.794959e+01
## HQ(n) 4.848710e+01 4.831924e+01 4.834296e+01 4.838303e+01 4.842066e+01
## SC(n) 4.896947e+01 4.887984e+01 4.898177e+01 4.910007e+01 4.921592e+01
## FPE(n) 8.582478e+20 6.927915e+20 6.773254e+20 6.731314e+20 6.673420e+20
## 11 12 13 14 15
## AIC(n) 4.796434e+01 4.797232e+01 4.793098e+01 4.784625e+01 4.785669e+01
## HQ(n) 4.848175e+01 4.853606e+01 4.854105e+01 4.850267e+01 4.855944e+01
## SC(n) 4.935524e+01 4.948777e+01 4.957098e+01 4.961082e+01 4.974581e+01
## FPE(n) 6.773175e+20 6.828064e+20 6.552260e+20 6.020802e+20 6.084899e+20
## 16 17 18 19 20
## AIC(n) 4.787187e+01 4.788878e+01 4.790679e+01 4.792483e+01 4.793009e+01
## HQ(n) 4.862095e+01 4.868420e+01 4.874855e+01 4.881291e+01 4.886451e+01
## SC(n) 4.988555e+01 5.002702e+01 5.016959e+01 5.031218e+01 5.044200e+01
## FPE(n) 6.179011e+20 6.285675e+20 6.401320e+20 6.519404e+20 6.555611e+20
## 21 22 23 24 25
## AIC(n) 4.791068e+01 4.792506e+01 4.793461e+01 4.794968e+01 4.795980e+01
## HQ(n) 4.889144e+01 4.895216e+01 4.900804e+01 4.906945e+01 4.912590e+01
## SC(n) 5.054715e+01 5.068609e+01 5.082019e+01 5.095982e+01 5.109449e+01
## FPE(n) 6.431546e+20 6.526882e+20 6.591889e+20 6.694662e+20 6.765652e+20
## 26 27 28 29 30
## AIC(n) 4.796230e+01 4.796745e+01 4.794752e+01 4.795279e+01 4.797840e+01
## HQ(n) 4.917473e+01 4.922622e+01 4.925262e+01 4.930422e+01 4.937618e+01
## SC(n) 5.122155e+01 5.135126e+01 5.145589e+01 5.158571e+01 5.173589e+01
## FPE(n) 6.785768e+20 6.824282e+20 6.693255e+20 6.732544e+20 6.911585e+20
## 31
## AIC(n) 4.800057e+01
## HQ(n) 4.944468e+01
## SC(n) 5.188261e+01
## FPE(n) 7.071268e+20
Note that this analysis is reassuring as all selection criteria return the multiple of a week’s length (approximately). We thus select p=7 days as our lag and justify it by its interpretability. We can now compute our VAR models and make predictions on test data. Recall that we consider VAR models with forecasting horizons of 1. Therefore, to make predictions on all test data, we must fit a VAR model for each test data on all past data:
# Predict with horizon 1 on each test day, to do so, we fit a VAR model on all past points and use prediction
# with horizon
s1_pred <- data.frame()
s2_pred <- data.frame()
s3_pred <- data.frame()
for (i in 1:394){
s1_df <- bind_rows(train.df[,store_1], test.df[c(1:i-1),store_1])
s2_df <- bind_rows(train.df[,store_2], test.df[c(1:i-1),store_2])
s3_df <- bind_rows(train.df[,store_3], test.df[c(1:i-1),store_3])
var.s1 <- VAR(s1_df, p=7)
var.s2 <- VAR(s2_df, p=7)
var.s3 <- VAR(s3_df, p=7)
var.s1.pred <- predict(var.s1, n.ahead=1)
var.s2.pred <- predict(var.s2, n.ahead=1)
var.s3.pred <- predict(var.s3, n.ahead=1)
s1_pred <- bind_rows(s1_pred,data.frame(t(var.s1.pred$model$y[nrow(var.s1.pred$model$y),])))
s2_pred <- bind_rows(s2_pred,data.frame(t(var.s2.pred$model$y[nrow(var.s2.pred$model$y),])))
s3_pred <- bind_rows(s3_pred,data.frame(t(var.s3.pred$model$y[nrow(var.s3.pred$model$y),])))
}
We can then recompute our store dataframes with ground truth values and prediction values:
# Get final ground truth and prediction dataframes on test data
s1_true <- ts_df[,store_1]
s1_pred_full <- bind_rows(train.df[,store_1],s1_pred)
s2_true <- ts_df[,store_2]
s2_pred_full <- bind_rows(train.df[,store_2],s2_pred)
s3_true <- ts_df[,store_3]
s3_pred_full <- bind_rows(train.df[,store_3],s3_pred)
For each store, we compute ground truth and prediction curves for each product and observe the satisfying precision of our models:
# Plot prediction and ground truth curves
time <- 1:394
# Store 1 plot
s1_p1 <- ggplot() + geom_line(data=s1_pred, aes(x=time,y=Hobbies_CA_1), colour="red") +
geom_line(data=test.df[,store_1], aes(x=time,y=Hobbies_CA_1)) + labs(y="Hobbies")
s1_p2 <- ggplot() + geom_line(data=s1_pred, aes(x=time,y=Household_1_CA_1), colour="red") +
geom_line(data=test.df[,store_1], aes(x=time,y=Household_1_CA_1)) + labs(y="Household 1")
s1_p3 <- ggplot() + geom_line(data=s1_pred, aes(x=time,y=Household_2_CA_1), colour="red") +
geom_line(data=test.df[,store_1], aes(x=time,y=Household_2_CA_1)) + labs(y="Household 2")
s1_p4 <- ggplot() + geom_line(data=s1_pred, aes(x=time,y=Foods_1_CA_1), colour="red") +
geom_line(data=test.df[,store_1], aes(x=time,y=Foods_1_CA_1)) + labs(y="Foods 1")
s1_p5 <- ggplot() + geom_line(data=s1_pred, aes(x=time,y=Foods_2_CA_1), colour="red") +
geom_line(data=test.df[,store_1], aes(x=time,y=Foods_2_CA_1)) + labs(y="Foods 2")
s1_p6 <- ggplot() + geom_line(data=s1_pred, aes(x=time,y=Foods_3_CA_1), colour="red") +
geom_line(data=test.df[,store_1], aes(x=time,y=Foods_3_CA_1)) + labs(y="Foods 3")
s1_fig <- ggarrange(s1_p1,s1_p2,s1_p3,s1_p4,s1_p5,s1_p6,
ncol=2, nrow=3)
annotate_figure(s1_fig,
top = text_grob("Store 1 Forecast", face = "bold", size = 14)
)
# Store 2 plot
s2_p1 <- ggplot() + geom_line(data=s2_pred, aes(x=time,y=Hobbies_CA_2), colour="red") +
geom_line(data=test.df[,store_2], aes(x=time,y=Hobbies_CA_2)) + labs(y="Hobbies")
s2_p2 <- ggplot() + geom_line(data=s2_pred, aes(x=time,y=Household_1_CA_2), colour="red") +
geom_line(data=test.df[,store_2], aes(x=time,y=Household_1_CA_2)) + labs(y="Household 1")
s2_p3 <- ggplot() + geom_line(data=s2_pred, aes(x=time,y=Household_2_CA_2), colour="red") +
geom_line(data=test.df[,store_2], aes(x=time,y=Household_2_CA_2)) + labs(y="Household 2")
s2_p4 <- ggplot() + geom_line(data=s2_pred, aes(x=time,y=Foods_1_CA_2), colour="red") +
geom_line(data=test.df[,store_2], aes(x=time,y=Foods_1_CA_2)) + labs(y="Foods 1")
s2_p5 <- ggplot() + geom_line(data=s2_pred, aes(x=time,y=Foods_2_CA_2), colour="red") +
geom_line(data=test.df[,store_2], aes(x=time,y=Foods_2_CA_2)) + labs(y="Foods 2")
s2_p6 <- ggplot() + geom_line(data=s2_pred, aes(x=time,y=Foods_3_CA_2), colour="red") +
geom_line(data=test.df[,store_2], aes(x=time,y=Foods_3_CA_2)) + labs(y="Foods 3")
s2_fig <- ggarrange(s2_p1,s2_p2,s2_p3,s2_p4,s2_p5,s2_p6,
ncol=2, nrow=3)
annotate_figure(s2_fig,
top = text_grob("Store 2 Forecast", face = "bold", size = 14)
)
# Store 3 plot
s3_p1 <- ggplot() + geom_line(data=s3_pred, aes(x=time,y=Hobbies_CA_3), colour="red") +
geom_line(data=test.df[,store_3], aes(x=time,y=Hobbies_CA_3)) + labs(y="Hobbies")
s3_p2 <- ggplot() + geom_line(data=s3_pred, aes(x=time,y=Household_1_CA_3), colour="red") +
geom_line(data=test.df[,store_3], aes(x=time,y=Household_1_CA_3)) + labs(y="Household 1")
s3_p3 <- ggplot() + geom_line(data=s3_pred, aes(x=time,y=Household_2_CA_3), colour="red") +
geom_line(data=test.df[,store_3], aes(x=time,y=Household_2_CA_3)) + labs(y="Household 2")
s3_p4 <- ggplot() + geom_line(data=s3_pred, aes(x=time,y=Foods_1_CA_3), colour="red") +
geom_line(data=test.df[,store_3], aes(x=time,y=Foods_1_CA_3)) + labs(y="Foods 1")
s3_p5 <- ggplot() + geom_line(data=s3_pred, aes(x=time,y=Foods_2_CA_3), colour="red") +
geom_line(data=test.df[,store_3], aes(x=time,y=Foods_2_CA_3)) + labs(y="Foods 2")
s3_p6 <- ggplot() + geom_line(data=s3_pred, aes(x=time,y=Foods_3_CA_3), colour="red") +
geom_line(data=test.df[,store_3], aes(x=time,y=Foods_3_CA_3)) + labs(y="Foods 3")
s3_fig <- ggarrange(s3_p1,s3_p2,s3_p3,s3_p4,s3_p5,s3_p6,
ncol=2, nrow=3)
annotate_figure(s3_fig,
top = text_grob("Store 3 Forecast", face = "bold", size = 14)
)
Finally, we evaluate our models using RMSE and RMSSE scores. These scores confirm the satisfying results observed above as they compare well to the models computed in Exercise 1.
# Compute RMSE for each prediction
print("Store 1 RMSE:")
## [1] "Store 1 RMSE:"
print(sqrt(mean(as.matrix((s1_true[c(1576:1969),] - s1_pred_full[c(1576:1969),])^2))))
## [1] 164.7499
print("Store 2 RMSE:")
## [1] "Store 2 RMSE:"
print(sqrt(mean(as.matrix((s2_true[c(1576:1969),] - s2_pred_full[c(1576:1969),])^2))))
## [1] 164.3728
print("Store 3 RMSE:")
## [1] "Store 3 RMSE:"
print(sqrt(mean(as.matrix((s3_true[c(1576:1969),] - s3_pred_full[c(1576:1969),])^2))))
## [1] 176.5701
# Compute RMSSE for each prediction
print("Store 1 RMSSE:")
## [1] "Store 1 RMSSE:"
print(rmsse(train.df[,store_1],s1_pred,test.df[,store_1]))
## [1] 0.1327655
print("Store 2 RMSSE:")
## [1] "Store 2 RMSSE:"
print(rmsse(train.df[,store_2],s2_pred,test.df[,store_2]))
## [1] 0.2205793
print("Store 3 RMSSE:")
## [1] "Store 3 RMSSE:"
print(rmsse(train.df[,store_3],s3_pred,test.df[,store_3]))
## [1] 0.1015475
In this section, we aim to fit VAR models to product data. We thus consider 6 multivariate time series, each corresponding to a product and constituted by the 3 store time series. We compute column indexes of each product in the main dataframe:
product_1 <- c(1,7,13)
product_2 <- c(2,8,14)
product_3 <- c(3,9,15)
product_4 <- c(4,10,16)
product_5 <- c(5,11,17)
product_6 <- c(6,12,18)
p1_df <- train.df[,product_1]
p2_df <- train.df[,product_2]
p3_df <- train.df[,product_3]
p4_df <- train.df[,product_4]
p5_df <- train.df[,product_5]
p6_df <- train.df[,product_6]
Just like in the previous study, we use information criteria to determine our models’ lags.
# Select lag length for each product (maximum lag is selected as the length of a month)
VARselect(p1_df, lag.max=31) # Information criteria mainly suggests p=21
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 21 14 7 21
##
## $criteria
## 1 2 3 4 5
## AIC(n) 2.495240e+01 2.486351e+01 2.483464e+01 2.477337e+01 2.474395e+01
## HQ(n) 2.496784e+01 2.489054e+01 2.487325e+01 2.482356e+01 2.480573e+01
## SC(n) 2.499392e+01 2.493617e+01 2.493844e+01 2.490830e+01 2.491003e+01
## FPE(n) 6.865775e+10 6.281819e+10 6.103043e+10 5.740334e+10 5.573944e+10
## 6 7 8 9 10
## AIC(n) 2.456823e+01 2.438206e+01 2.438054e+01 2.436552e+01 2.436435e+01
## HQ(n) 2.464159e+01 2.446700e+01 2.447707e+01 2.447364e+01 2.448405e+01
## SC(n) 2.476545e+01 2.461041e+01 2.464003e+01 2.465616e+01 2.468612e+01
## FPE(n) 4.675719e+10 3.881460e+10 3.875571e+10 3.817831e+10 3.813375e+10
## 11 12 13 14 15
## AIC(n) 2.436441e+01 2.436158e+01 2.430968e+01 2.424261e+01 2.424033e+01
## HQ(n) 2.449569e+01 2.450445e+01 2.446413e+01 2.440864e+01 2.441795e+01
## SC(n) 2.471732e+01 2.474563e+01 2.472487e+01 2.468894e+01 2.471780e+01
## FPE(n) 3.813615e+10 3.802871e+10 3.610551e+10 3.376373e+10 3.368720e+10
## 16 17 18 19 20
## AIC(n) 2.423106e+01 2.423720e+01 2.424067e+01 2.423621e+01 2.423664e+01
## HQ(n) 2.442026e+01 2.443798e+01 2.445304e+01 2.446016e+01 2.447217e+01
## SC(n) 2.473967e+01 2.477695e+01 2.481156e+01 2.483823e+01 2.486980e+01
## FPE(n) 3.337662e+10 3.358263e+10 3.369997e+10 3.355034e+10 3.356532e+10
## 21 22 23 24 25
## AIC(n) 2.422651e+01 2.423336e+01 2.422893e+01 2.423320e+01 2.423773e+01
## HQ(n) 2.447363e+01 2.449206e+01 2.449922e+01 2.451507e+01 2.453118e+01
## SC(n) 2.489082e+01 2.492880e+01 2.495552e+01 2.499092e+01 2.502659e+01
## FPE(n) 3.322793e+10 3.345676e+10 3.330980e+10 3.345304e+10 3.360593e+10
## 26 27 28 29 30
## AIC(n) 2.424503e+01 2.424773e+01 2.423196e+01 2.423118e+01 2.423900e+01
## HQ(n) 2.455007e+01 2.456435e+01 2.456017e+01 2.457097e+01 2.459037e+01
## SC(n) 2.506504e+01 2.509887e+01 2.511425e+01 2.514461e+01 2.518356e+01
## FPE(n) 3.385327e+10 3.394563e+10 3.341590e+10 3.339110e+10 3.365441e+10
## 31
## AIC(n) 2.424412e+01
## HQ(n) 2.460708e+01
## SC(n) 2.521982e+01
## FPE(n) 3.382861e+10
VARselect(p2_df, lag.max=31) # Information criteria mainly suggests p=28
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 28 14 7 28
##
## $criteria
## 1 2 3 4 5
## AIC(n) 2.225496e+01 2.192836e+01 2.185911e+01 2.179481e+01 2.167138e+01
## HQ(n) 2.227040e+01 2.195539e+01 2.189773e+01 2.184500e+01 2.173316e+01
## SC(n) 2.229648e+01 2.200102e+01 2.196291e+01 2.192974e+01 2.183746e+01
## FPE(n) 4.625990e+09 3.337089e+09 3.113812e+09 2.919880e+09 2.580841e+09
## 6 7 8 9 10
## AIC(n) 2.123194e+01 2.083713e+01 2.084043e+01 2.083290e+01 2.084006e+01
## HQ(n) 2.130530e+01 2.092208e+01 2.093696e+01 2.094102e+01 2.095976e+01
## SC(n) 2.142915e+01 2.106549e+01 2.109992e+01 2.112354e+01 2.116183e+01
## FPE(n) 1.663087e+09 1.120606e+09 1.124309e+09 1.115887e+09 1.123904e+09
## 11 12 13 14 15
## AIC(n) 2.084042e+01 2.084189e+01 2.078165e+01 2.063511e+01 2.063719e+01
## HQ(n) 2.097170e+01 2.098475e+01 2.093610e+01 2.080115e+01 2.081481e+01
## SC(n) 2.119333e+01 2.122594e+01 2.119684e+01 2.108145e+01 2.111466e+01
## FPE(n) 1.124317e+09 1.125975e+09 1.060159e+09 9.156616e+08 9.175770e+08
## 16 17 18 19 20
## AIC(n) 2.064034e+01 2.064508e+01 2.064555e+01 2.065164e+01 2.065360e+01
## HQ(n) 2.082954e+01 2.084586e+01 2.085792e+01 2.087559e+01 2.088913e+01
## SC(n) 2.114895e+01 2.118483e+01 2.121644e+01 2.125367e+01 2.128676e+01
## FPE(n) 9.204801e+08 9.248620e+08 9.253105e+08 9.309762e+08 9.328162e+08
## 21 22 23 24 25
## AIC(n) 2.060617e+01 2.060613e+01 2.060866e+01 2.061128e+01 2.061637e+01
## HQ(n) 2.085329e+01 2.086484e+01 2.087895e+01 2.089316e+01 2.090983e+01
## SC(n) 2.127048e+01 2.130158e+01 2.133525e+01 2.136901e+01 2.140524e+01
## FPE(n) 8.896271e+08 8.896135e+08 8.918881e+08 8.942503e+08 8.988365e+08
## 26 27 28 29 30
## AIC(n) 2.062174e+01 2.062543e+01 2.058738e+01 2.058806e+01 2.058749e+01
## HQ(n) 2.092678e+01 2.094205e+01 2.091559e+01 2.092785e+01 2.093887e+01
## SC(n) 2.144175e+01 2.147657e+01 2.146967e+01 2.150148e+01 2.153205e+01
## FPE(n) 9.037020e+08 9.070672e+08 8.732370e+08 8.738566e+08 8.733989e+08
## 31
## AIC(n) 2.059455e+01
## HQ(n) 2.095751e+01
## SC(n) 2.157025e+01
## FPE(n) 8.796229e+08
VARselect(p3_df, lag.max=31) # Information criteria mainly suggests p=28
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 28 14 7 28
##
## $criteria
## 1 2 3 4 5
## AIC(n) 1.931731e+01 1.905240e+01 1.900274e+01 1.894061e+01 1.888825e+01
## HQ(n) 1.933275e+01 1.907943e+01 1.904135e+01 1.899080e+01 1.895003e+01
## SC(n) 1.935882e+01 1.912506e+01 1.910654e+01 1.907555e+01 1.905433e+01
## FPE(n) 2.451314e+08 1.880836e+08 1.789725e+08 1.681907e+08 1.596119e+08
## 6 7 8 9 10
## AIC(n) 1.849227e+01 1.808515e+01 1.806703e+01 1.805642e+01 1.805968e+01
## HQ(n) 1.856564e+01 1.817010e+01 1.816356e+01 1.816454e+01 1.817938e+01
## SC(n) 1.868949e+01 1.831351e+01 1.832653e+01 1.834706e+01 1.838146e+01
## FPE(n) 1.074220e+08 7.149643e+07 7.021272e+07 6.947206e+07 6.969922e+07
## 11 12 13 14 15
## AIC(n) 1.806263e+01 1.806365e+01 1.804698e+01 1.792961e+01 1.793634e+01
## HQ(n) 1.819391e+01 1.820652e+01 1.820143e+01 1.809564e+01 1.811395e+01
## SC(n) 1.841554e+01 1.844770e+01 1.846217e+01 1.837594e+01 1.841381e+01
## FPE(n) 6.990538e+07 6.997710e+07 6.882061e+07 6.119953e+07 6.161331e+07
## 16 17 18 19 20
## AIC(n) 1.794012e+01 1.794288e+01 1.794903e+01 1.795258e+01 1.795200e+01
## HQ(n) 1.812933e+01 1.814367e+01 1.816140e+01 1.817654e+01 1.818754e+01
## SC(n) 1.844873e+01 1.848263e+01 1.851992e+01 1.855461e+01 1.858517e+01
## FPE(n) 6.184791e+07 6.201939e+07 6.240304e+07 6.262594e+07 6.259058e+07
## 21 22 23 24 25
## AIC(n) 1.793150e+01 1.793228e+01 1.793712e+01 1.794517e+01 1.795160e+01
## HQ(n) 1.817862e+01 1.819098e+01 1.820741e+01 1.822704e+01 1.824506e+01
## SC(n) 1.859580e+01 1.862772e+01 1.866370e+01 1.870289e+01 1.874047e+01
## FPE(n) 6.132134e+07 6.137055e+07 6.166980e+07 6.216991e+07 6.257293e+07
## 26 27 28 29 30
## AIC(n) 1.795394e+01 1.795898e+01 1.791595e+01 1.791679e+01 1.791755e+01
## HQ(n) 1.825897e+01 1.827560e+01 1.824415e+01 1.825658e+01 1.826892e+01
## SC(n) 1.877394e+01 1.881012e+01 1.879823e+01 1.883021e+01 1.886211e+01
## FPE(n) 6.272080e+07 6.303982e+07 6.038679e+07 6.044001e+07 6.048801e+07
## 31
## AIC(n) 1.792285e+01
## HQ(n) 1.828581e+01
## SC(n) 1.889855e+01
## FPE(n) 6.081236e+07
VARselect(p4_df, lag.max=31) # Information criteria mainly suggests p=29
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 29 15 9 29
##
## $criteria
## 1 2 3 4 5
## AIC(n) 1.825532e+01 1.815510e+01 1.811909e+01 1.810333e+01 1.805724e+01
## HQ(n) 1.827077e+01 1.818212e+01 1.815770e+01 1.815353e+01 1.811902e+01
## SC(n) 1.829684e+01 1.822775e+01 1.822289e+01 1.823827e+01 1.822332e+01
## FPE(n) 8.475898e+07 7.667571e+07 7.396407e+07 7.280783e+07 6.952822e+07
## 6 7 8 9 10
## AIC(n) 1.794432e+01 1.785163e+01 1.780322e+01 1.774303e+01 1.773334e+01
## HQ(n) 1.801769e+01 1.793658e+01 1.789975e+01 1.785115e+01 1.785304e+01
## SC(n) 1.814154e+01 1.807999e+01 1.806272e+01 1.803367e+01 1.805512e+01
## FPE(n) 6.210440e+07 5.660658e+07 5.393183e+07 5.078160e+07 5.029213e+07
## 11 12 13 14 15
## AIC(n) 1.772865e+01 1.771493e+01 1.769889e+01 1.766846e+01 1.765494e+01
## HQ(n) 1.785993e+01 1.785779e+01 1.785334e+01 1.783450e+01 1.783256e+01
## SC(n) 1.808156e+01 1.809898e+01 1.811408e+01 1.811480e+01 1.813241e+01
## FPE(n) 5.005677e+07 4.937490e+07 4.858964e+07 4.713410e+07 4.650132e+07
## 16 17 18 19 20
## AIC(n) 1.765028e+01 1.765308e+01 1.765227e+01 1.765613e+01 1.765668e+01
## HQ(n) 1.783948e+01 1.785386e+01 1.786464e+01 1.788008e+01 1.789222e+01
## SC(n) 1.815889e+01 1.819283e+01 1.822316e+01 1.825816e+01 1.828985e+01
## FPE(n) 4.628552e+07 4.641596e+07 4.637925e+07 4.655920e+07 4.658571e+07
## 21 22 23 24 25
## AIC(n) 1.763362e+01 1.763887e+01 1.764295e+01 1.763932e+01 1.762792e+01
## HQ(n) 1.788074e+01 1.789757e+01 1.791324e+01 1.792119e+01 1.792137e+01
## SC(n) 1.829793e+01 1.833431e+01 1.836953e+01 1.839704e+01 1.841678e+01
## FPE(n) 4.552463e+07 4.576494e+07 4.595329e+07 4.578795e+07 4.527000e+07
## 26 27 28 29 30
## AIC(n) 1.761429e+01 1.758550e+01 1.755401e+01 1.755215e+01 1.755661e+01
## HQ(n) 1.791933e+01 1.790212e+01 1.788222e+01 1.789194e+01 1.790798e+01
## SC(n) 1.843429e+01 1.843664e+01 1.843629e+01 1.846557e+01 1.850117e+01
## FPE(n) 4.465859e+07 4.339271e+07 4.204894e+07 4.197214e+07 4.216150e+07
## 31
## AIC(n) 1.756389e+01
## HQ(n) 1.792685e+01
## SC(n) 1.853959e+01
## FPE(n) 4.247159e+07
VARselect(p5_df, lag.max=31) # Information criteria mainly suggests p=23
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 23 14 7 23
##
## $criteria
## 1 2 3 4 5
## AIC(n) 2.427574e+01 2.407679e+01 2.402695e+01 2.397044e+01 2.391158e+01
## HQ(n) 2.429118e+01 2.410382e+01 2.406557e+01 2.402063e+01 2.397336e+01
## SC(n) 2.431725e+01 2.414945e+01 2.413075e+01 2.410537e+01 2.407766e+01
## FPE(n) 3.489939e+10 2.860341e+10 2.721283e+10 2.571747e+10 2.424763e+10
## 6 7 8 9 10
## AIC(n) 2.370184e+01 2.347248e+01 2.345780e+01 2.344144e+01 2.344406e+01
## HQ(n) 2.377520e+01 2.355743e+01 2.355433e+01 2.354956e+01 2.356376e+01
## SC(n) 2.389905e+01 2.370084e+01 2.371729e+01 2.373208e+01 2.376584e+01
## FPE(n) 1.965981e+10 1.563050e+10 1.540268e+10 1.515284e+10 1.519266e+10
## 11 12 13 14 15
## AIC(n) 2.344886e+01 2.345837e+01 2.341680e+01 2.336675e+01 2.336155e+01
## HQ(n) 2.358014e+01 2.360124e+01 2.357125e+01 2.353279e+01 2.353916e+01
## SC(n) 2.380177e+01 2.384242e+01 2.383199e+01 2.381309e+01 2.383902e+01
## FPE(n) 1.526573e+10 1.541177e+10 1.478434e+10 1.406279e+10 1.398986e+10
## 16 17 18 19 20
## AIC(n) 2.335990e+01 2.337009e+01 2.337454e+01 2.337281e+01 2.336653e+01
## HQ(n) 2.354911e+01 2.357087e+01 2.358691e+01 2.359677e+01 2.360207e+01
## SC(n) 2.386851e+01 2.390984e+01 2.394543e+01 2.397484e+01 2.399970e+01
## FPE(n) 1.396706e+10 1.411020e+10 1.417339e+10 1.414914e+10 1.406082e+10
## 21 22 23 24 25
## AIC(n) 2.333481e+01 2.333836e+01 2.333397e+01 2.334415e+01 2.335273e+01
## HQ(n) 2.358193e+01 2.359707e+01 2.360426e+01 2.362602e+01 2.364618e+01
## SC(n) 2.399911e+01 2.403381e+01 2.406056e+01 2.410187e+01 2.414159e+01
## FPE(n) 1.362199e+10 1.367080e+10 1.361121e+10 1.375075e+10 1.386961e+10
## 26 27 28 29 30
## AIC(n) 2.335618e+01 2.335876e+01 2.335001e+01 2.335485e+01 2.336052e+01
## HQ(n) 2.366122e+01 2.367539e+01 2.367822e+01 2.369464e+01 2.371189e+01
## SC(n) 2.417619e+01 2.420991e+01 2.423230e+01 2.426827e+01 2.430508e+01
## FPE(n) 1.391805e+10 1.395443e+10 1.383336e+10 1.390097e+10 1.398050e+10
## 31
## AIC(n) 2.336517e+01
## HQ(n) 2.372813e+01
## SC(n) 2.434087e+01
## FPE(n) 1.404624e+10
VARselect(p6_df, lag.max=31) # Information criteria mainly suggests p=29
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 29 14 7 29
##
## $criteria
## 1 2 3 4 5
## AIC(n) 3.160505e+01 3.125589e+01 3.117194e+01 3.109239e+01 3.101470e+01
## HQ(n) 3.162050e+01 3.128292e+01 3.121055e+01 3.114258e+01 3.107648e+01
## SC(n) 3.164657e+01 3.132855e+01 3.127574e+01 3.122732e+01 3.118077e+01
## FPE(n) 5.319859e+13 3.751973e+13 3.449865e+13 3.186044e+13 2.947904e+13
## 6 7 8 9 10
## AIC(n) 3.071029e+01 3.038436e+01 3.037828e+01 3.035686e+01 3.036010e+01
## HQ(n) 3.078366e+01 3.046930e+01 3.047481e+01 3.046498e+01 3.047980e+01
## SC(n) 3.090751e+01 3.061271e+01 3.063777e+01 3.064750e+01 3.068187e+01
## FPE(n) 2.174262e+13 1.569498e+13 1.559990e+13 1.526944e+13 1.531901e+13
## 11 12 13 14 15
## AIC(n) 3.036370e+01 3.036296e+01 3.032475e+01 3.022709e+01 3.022268e+01
## HQ(n) 3.049498e+01 3.050583e+01 3.047920e+01 3.039312e+01 3.040030e+01
## SC(n) 3.071661e+01 3.074702e+01 3.073994e+01 3.067342e+01 3.070015e+01
## FPE(n) 1.537436e+13 1.536311e+13 1.478727e+13 1.341148e+13 1.335264e+13
## 16 17 18 19 20
## AIC(n) 3.022698e+01 3.023316e+01 3.024314e+01 3.024557e+01 3.023443e+01
## HQ(n) 3.041618e+01 3.043394e+01 3.045551e+01 3.046952e+01 3.046997e+01
## SC(n) 3.073559e+01 3.077291e+01 3.081403e+01 3.084760e+01 3.086760e+01
## FPE(n) 1.341034e+13 1.349357e+13 1.362919e+13 1.366252e+13 1.351137e+13
## 21 22 23 24 25
## AIC(n) 3.019457e+01 3.018809e+01 3.018524e+01 3.018104e+01 3.018609e+01
## HQ(n) 3.044169e+01 3.044679e+01 3.045553e+01 3.046292e+01 3.047954e+01
## SC(n) 3.085887e+01 3.088353e+01 3.091183e+01 3.093877e+01 3.097495e+01
## FPE(n) 1.298363e+13 1.290002e+13 1.286368e+13 1.281012e+13 1.287523e+13
## 26 27 28 29 30
## AIC(n) 3.016748e+01 3.015791e+01 3.012075e+01 3.011703e+01 3.011838e+01
## HQ(n) 3.047252e+01 3.047454e+01 3.044896e+01 3.045682e+01 3.046975e+01
## SC(n) 3.098748e+01 3.100906e+01 3.100303e+01 3.103045e+01 3.106294e+01
## FPE(n) 1.263820e+13 1.251832e+13 1.206203e+13 1.201773e+13 1.203436e+13
## 31
## AIC(n) 3.011953e+01
## HQ(n) 3.048249e+01
## SC(n) 3.109523e+01
## FPE(n) 1.204873e+13
This analysis isn’t as consistent as it was for the stores in Question 1. At least, suggested lags are less interpretable. We decide do use the suggested lag at first and will decide to move to p=7 in case of poor results. We then fit our VAR models on test data just like in the previous question.
p1_pred <- data.frame()
p2_pred <- data.frame()
p3_pred <- data.frame()
p4_pred <- data.frame()
p5_pred <- data.frame()
p6_pred <- data.frame()
for (i in 1:394){
p1_df <- bind_rows(train.df[,product_1], test.df[c(1:i-1),product_1])
p2_df <- bind_rows(train.df[,product_2], test.df[c(1:i-1),product_2])
p3_df <- bind_rows(train.df[,product_3], test.df[c(1:i-1),product_3])
p4_df <- bind_rows(train.df[,product_4], test.df[c(1:i-1),product_4])
p5_df <- bind_rows(train.df[,product_5], test.df[c(1:i-1),product_5])
p6_df <- bind_rows(train.df[,product_6], test.df[c(1:i-1),product_6])
var.p1 <- VAR(p1_df, p=21)
var.p2 <- VAR(p2_df, p=28)
var.p3 <- VAR(p3_df, p=28)
var.p4 <- VAR(p4_df, p=29)
var.p5 <- VAR(p5_df, p=23)
var.p6 <- VAR(p6_df, p=29)
var.p1.pred <- predict(var.p1, n.ahead=1)
var.p2.pred <- predict(var.p2, n.ahead=1)
var.p3.pred <- predict(var.p3, n.ahead=1)
var.p4.pred <- predict(var.p4, n.ahead=1)
var.p5.pred <- predict(var.p5, n.ahead=1)
var.p6.pred <- predict(var.p6, n.ahead=1)
p1_pred <- bind_rows(p1_pred,data.frame(t(var.p1.pred$model$y[nrow(var.p1.pred$model$y),])))
p2_pred <- bind_rows(p2_pred,data.frame(t(var.p2.pred$model$y[nrow(var.p2.pred$model$y),])))
p3_pred <- bind_rows(p3_pred,data.frame(t(var.p3.pred$model$y[nrow(var.p3.pred$model$y),])))
p4_pred <- bind_rows(p4_pred,data.frame(t(var.p4.pred$model$y[nrow(var.p4.pred$model$y),])))
p5_pred <- bind_rows(p5_pred,data.frame(t(var.p5.pred$model$y[nrow(var.p5.pred$model$y),])))
p6_pred <- bind_rows(p6_pred,data.frame(t(var.p6.pred$model$y[nrow(var.p6.pred$model$y),])))
}
We can then compute complete ground truth and predicted dataframes for each product.
p1_true <- ts_df[,product_1]
p1_pred_full <- bind_rows(train.df[,product_1],p1_pred)
p2_true <- ts_df[,product_2]
p2_pred_full <- bind_rows(train.df[,product_2],p2_pred)
p3_true <- ts_df[,product_3]
p3_pred_full <- bind_rows(train.df[,product_3],p3_pred)
p4_true <- ts_df[,product_4]
p4_pred_full <- bind_rows(train.df[,product_4],p4_pred)
p5_true <- ts_df[,product_5]
p5_pred_full <- bind_rows(train.df[,product_5],p5_pred)
p6_true <- ts_df[,product_6]
p6_pred_full <- bind_rows(train.df[,product_6],p6_pred)
Just like for store predictions, the following plots show the great precision of our models’ predictions.
# Plot prediction and ground truth curves
time <- 1:394
# Product 1 plot
p1_s1 <- ggplot() + geom_line(data=p1_pred, aes(x=time,y=Hobbies_CA_1), colour="red") +
geom_line(data=test.df[,product_1], aes(x=time,y=Hobbies_CA_1)) + labs(y="Store 1")
p1_s2 <- ggplot() + geom_line(data=p1_pred, aes(x=time,y=Hobbies_CA_2), colour="red") +
geom_line(data=test.df[,product_1], aes(x=time,y=Hobbies_CA_2)) + labs(y="Store 2")
p1_s3 <- ggplot() + geom_line(data=p1_pred, aes(x=time,y=Hobbies_CA_3), colour="red") +
geom_line(data=test.df[,product_1], aes(x=time,y=Hobbies_CA_3)) + labs(y="Store 3")
p1_fig <- ggarrange(p1_s1, p1_s2, p1_s3,
ncol=1, nrow=3)
annotate_figure(p1_fig,
top = text_grob("Hobbies Product Forecast", face = "bold", size = 14)
)
# Product 2 plot
p2_s1 <- ggplot() + geom_line(data=p2_pred, aes(x=time,y=Household_1_CA_1), colour="red") +
geom_line(data=test.df[,product_2], aes(x=time,y=Household_1_CA_1)) + labs(y="Store 1")
p2_s2 <- ggplot() + geom_line(data=p2_pred, aes(x=time,y=Household_1_CA_2), colour="red") +
geom_line(data=test.df[,product_2], aes(x=time,y=Household_1_CA_2)) + labs(y="Store 2")
p2_s3 <- ggplot() + geom_line(data=p2_pred, aes(x=time,y=Household_1_CA_3), colour="red") +
geom_line(data=test.df[,product_2], aes(x=time,y=Household_1_CA_3)) + labs(y="Store 3")
p2_fig <- ggarrange(p2_s1, p2_s2, p2_s3,
ncol=1, nrow=3)
annotate_figure(p2_fig,
top = text_grob("Household 1 Product Forecast", face = "bold", size = 14)
)
# Product 3 plot
p3_s1 <- ggplot() + geom_line(data=p3_pred, aes(x=time,y=Household_2_CA_1), colour="red") +
geom_line(data=test.df[,product_3], aes(x=time,y=Household_2_CA_1)) + labs(y="Store 1")
p3_s2 <- ggplot() + geom_line(data=p3_pred, aes(x=time,y=Household_2_CA_2), colour="red") +
geom_line(data=test.df[,product_3], aes(x=time,y=Household_2_CA_2)) + labs(y="Store 2")
p3_s3 <- ggplot() + geom_line(data=p3_pred, aes(x=time,y=Household_2_CA_3), colour="red") +
geom_line(data=test.df[,product_3], aes(x=time,y=Household_2_CA_3)) + labs(y="Store 3")
p3_fig <- ggarrange(p3_s1, p3_s2, p3_s3,
ncol=1, nrow=3)
annotate_figure(p3_fig,
top = text_grob("Household 2 Product Forecast", face = "bold", size = 14)
)
# Product 4 plot
p4_s1 <- ggplot() + geom_line(data=p4_pred, aes(x=time,y=Foods_1_CA_1), colour="red") +
geom_line(data=test.df[,product_4], aes(x=time,y=Foods_1_CA_1)) + labs(y="Store 1")
p4_s2 <- ggplot() + geom_line(data=p4_pred, aes(x=time,y=Foods_1_CA_2), colour="red") +
geom_line(data=test.df[,product_4], aes(x=time,y=Foods_1_CA_2)) + labs(y="Store 2")
p4_s3 <- ggplot() + geom_line(data=p4_pred, aes(x=time,y=Foods_1_CA_3), colour="red") +
geom_line(data=test.df[,product_4], aes(x=time,y=Foods_1_CA_3)) + labs(y="Store 3")
p4_fig <- ggarrange(p4_s1, p4_s2, p4_s3,
ncol=1, nrow=3)
annotate_figure(p4_fig,
top = text_grob("Foods 1 Product Forecast", face = "bold", size = 14)
)
# Product 5 plot
p5_s1 <- ggplot() + geom_line(data=p5_pred, aes(x=time,y=Foods_2_CA_1), colour="red") +
geom_line(data=test.df[,product_5], aes(x=time,y=Foods_2_CA_1)) + labs(y="Store 1")
p5_s2 <- ggplot() + geom_line(data=p5_pred, aes(x=time,y=Foods_2_CA_2), colour="red") +
geom_line(data=test.df[,product_5], aes(x=time,y=Foods_2_CA_2)) + labs(y="Store 2")
p5_s3 <- ggplot() + geom_line(data=p5_pred, aes(x=time,y=Foods_2_CA_3), colour="red") +
geom_line(data=test.df[,product_5], aes(x=time,y=Foods_2_CA_3)) + labs(y="Store 3")
p5_fig <- ggarrange(p5_s1, p5_s2, p5_s3,
ncol=1, nrow=3)
annotate_figure(p5_fig,
top = text_grob("Foods 2 Product Forecast", face = "bold", size = 14)
)
# Product 6 plot
p6_s1 <- ggplot() + geom_line(data=p6_pred, aes(x=time,y=Foods_3_CA_1), colour="red") +
geom_line(data=test.df[,product_6], aes(x=time,y=Foods_3_CA_1)) + labs(y="Store 1")
p6_s2 <- ggplot() + geom_line(data=p6_pred, aes(x=time,y=Foods_3_CA_2), colour="red") +
geom_line(data=test.df[,product_6], aes(x=time,y=Foods_3_CA_2)) + labs(y="Store 2")
p6_s3 <- ggplot() + geom_line(data=p6_pred, aes(x=time,y=Foods_3_CA_3), colour="red") +
geom_line(data=test.df[,product_6], aes(x=time,y=Foods_3_CA_3)) + labs(y="Store 3")
p6_fig <- ggarrange(p6_s1, p6_s2, p6_s3,
ncol=1, nrow=3)
annotate_figure(p6_fig,
top = text_grob("Foods 3 Product Forecast", face = "bold", size = 14)
)
Finally, we compute RMSE and RMSSE scores for each model, that also compare very well to the models computed in Exercise 1 and motivate the use of multivariate models in this case.
# Compute RMSE for each prediction
print("Product 1 RMSE:")
## [1] "Product 1 RMSE:"
print(sqrt(mean(as.matrix((p1_true[c(1576:1969),] - p1_pred_full[c(1576:1969),])^2))))
## [1] 102.9941
print("Product 2 RMSE:")
## [1] "Product 2 RMSE:"
print(sqrt(mean(as.matrix((p2_true[c(1576:1969),] - p2_pred_full[c(1576:1969),])^2))))
## [1] 82.00462
print("Product 3 RMSE:")
## [1] "Product 3 RMSE:"
print(sqrt(mean(as.matrix((p3_true[c(1576:1969),] - p3_pred_full[c(1576:1969),])^2))))
## [1] 45.58218
print("Product 4 RMSE:")
## [1] "Product 4 RMSE:"
print(sqrt(mean(as.matrix((p4_true[c(1576:1969),] - p4_pred_full[c(1576:1969),])^2))))
## [1] 48.07086
print("Product 5 RMSE:")
## [1] "Product 5 RMSE:"
print(sqrt(mean(as.matrix((p5_true[c(1576:1969),] - p5_pred_full[c(1576:1969),])^2))))
## [1] 77.88185
print("Product 6 RMSE:")
## [1] "Product 6 RMSE:"
print(sqrt(mean(as.matrix((p6_true[c(1576:1969),] - p6_pred_full[c(1576:1969),])^2))))
## [1] 378.0069
# Compute RMSSE for each prediction
print("Product 1 RMSSE:")
## [1] "Product 1 RMSSE:"
print(rmsse(train.df[,product_1],p1_pred,test.df[,product_1]))
## [1] 0.3858423
print("Product 2 RMSSE:")
## [1] "Product 2 RMSSE:"
print(rmsse(train.df[,product_2],p2_pred,test.df[,product_2]))
## [1] 0.3193946
print("Product 3 RMSSE:")
## [1] "Product 3 RMSSE:"
print(rmsse(train.df[,product_3],p3_pred,test.df[,product_3]))
## [1] 0.488002
print("Product 4 RMSSE:")
## [1] "Product 4 RMSSE:"
print(rmsse(train.df[,product_4],p4_pred,test.df[,product_4]))
## [1] 0.1749373
print("Product 5 RMSSE:")
## [1] "Product 5 RMSSE:"
print(rmsse(train.df[,product_5],p5_pred,test.df[,product_5]))
## [1] 0.150227
print("Product 6 RMSSE:")
## [1] "Product 6 RMSSE:"
print(rmsse(train.df[,product_6],p6_pred,test.df[,product_6]))
## [1] 0.1663714
For this final part of VAR modeling, we decide to use the whole dataset as a multivariate time series, accross products and stores. This can be justified by the apparent dependencies of each time series to others. Therefore, splitting the dataset in any way would remove valuable information to predict the evolution of each series. We follow the same process as in the previous two questions. First, we select our model’s lag:
VARselect(train.df, lag.max=31) # Information criteria mainly suggests p=7
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 7 2 1 7
##
## $criteria
## 1 2 3 4 5
## AIC(n) 1.323697e+02 1.316308e+02 1.314288e+02 1.311996e+02 1.310775e+02
## HQ(n) 1.328098e+02 1.324880e+02 1.327030e+02 1.328908e+02 1.331857e+02
## SC(n) 1.335530e+02 1.339351e+02 1.348542e+02 1.357459e+02 1.367448e+02
## FPE(n) 3.072009e+57 1.467573e+57 1.199643e+57 9.545569e+56 8.458530e+56
## 6 7 8 9 10
## AIC(n) 1.307290e+02 1.303707e+02 1.304920e+02 1.306052e+02 1.307388e+02
## HQ(n) 1.332543e+02 1.333130e+02 1.338513e+02 1.343815e+02 1.349321e+02
## SC(n) 1.375174e+02 1.382801e+02 1.395224e+02 1.407566e+02 1.420112e+02
## FPE(n) 5.980288e+56 4.189671e+56 4.745722e+56 5.336907e+56 6.131804e+56
## 11 12 13 14 15
## AIC(n) 1.309070e+02 1.310221e+02 1.311041e+02 1.310831e+02 1.311931e+02
## HQ(n) 1.355174e+02 1.360494e+02 1.365485e+02 1.369445e+02 1.374715e+02
## SC(n) 1.433005e+02 1.445366e+02 1.457396e+02 1.468396e+02 1.480707e+02
## FPE(n) 7.302640e+56 8.256631e+56 9.045638e+56 8.953067e+56 1.011962e+57
## 16 17 18 19 20
## AIC(n) 1.313550e+02 1.315135e+02 1.316831e+02 1.318389e+02 1.319580e+02
## HQ(n) 1.380504e+02 1.386260e+02 1.392126e+02 1.397854e+02 1.403215e+02
## SC(n) 1.493536e+02 1.506331e+02 1.519237e+02 1.532005e+02 1.544407e+02
## FPE(n) 1.206914e+57 1.437366e+57 1.734575e+57 2.069160e+57 2.385269e+57
## 21 22 23 24 25
## AIC(n) 1.320267e+02 1.321305e+02 1.322346e+02 1.323216e+02 1.324288e+02
## HQ(n) 1.408072e+02 1.413280e+02 1.418491e+02 1.423532e+02 1.428774e+02
## SC(n) 1.556303e+02 1.568552e+02 1.580803e+02 1.592883e+02 1.605165e+02
## FPE(n) 2.620969e+57 2.991072e+57 3.424441e+57 3.865872e+57 4.467632e+57
## 26 27 28 29 30
## AIC(n) 1.324961e+02 1.325581e+02 1.325991e+02 1.326406e+02 1.327231e+02
## HQ(n) 1.433616e+02 1.438407e+02 1.442987e+02 1.447572e+02 1.452567e+02
## SC(n) 1.617048e+02 1.628878e+02 1.640499e+02 1.652124e+02 1.664159e+02
## FPE(n) 4.977796e+57 5.537191e+57 6.054718e+57 6.650105e+57 7.641914e+57
## 31
## AIC(n) 1.328379e+02
## HQ(n) 1.457885e+02
## SC(n) 1.676517e+02
## FPE(n) 9.110632e+57
Just like in Question 1, information criteria suggest the use of p=7 as lag. This feels quite logical as the smallest seasonality of our signal is of a week’s length. We then fit our models using the following training loop:
full_pred <- data.frame()
for (i in 1:394){
full_df <- bind_rows(train.df, test.df[c(1:i-1),])
var.full <- VAR(full_df, p=7)
var.full.pred <- predict(var.full, n.ahead=1)
full_pred <- bind_rows(full_pred,data.frame(t(var.full.pred$model$y[nrow(var.full.pred$model$y),])))
}
We then compute our ground truth and predictions time series and visualize them:
# Get final ground truth and prediction dataframes
full_true <- ts_df[,]
full_pred_full <- bind_rows(train.df,full_pred)
# Full Model Plot
full_11 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Hobbies_CA_1), colour="red") +
geom_line(data=test.df, aes(x=time,y=Hobbies_CA_1)) + labs(y="Hobbies")
full_12 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Hobbies_CA_2), colour="red") +
geom_line(data=test.df, aes(x=time,y=Hobbies_CA_2)) + labs(y="Hobbies")
full_13 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Hobbies_CA_3), colour="red") +
geom_line(data=test.df, aes(x=time,y=Hobbies_CA_3)) + labs(y="Hobbies")
full_21 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Household_1_CA_1), colour="red") +
geom_line(data=test.df, aes(x=time,y=Household_1_CA_1)) + labs(y="Household 1")
full_22 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Household_1_CA_2), colour="red") +
geom_line(data=test.df, aes(x=time,y=Household_1_CA_2)) + labs(y="Household 1")
full_23 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Household_1_CA_3), colour="red") +
geom_line(data=test.df, aes(x=time,y=Household_1_CA_3)) + labs(y="Household 1")
full_31 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Household_2_CA_1), colour="red") +
geom_line(data=test.df, aes(x=time,y=Household_2_CA_1)) + labs(y="Household 2")
full_32 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Household_2_CA_2), colour="red") +
geom_line(data=test.df, aes(x=time,y=Household_2_CA_2)) + labs(y="Household 2")
full_33 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Household_2_CA_3), colour="red") +
geom_line(data=test.df, aes(x=time,y=Household_2_CA_3)) + labs(y="Household 2")
full_41 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Foods_1_CA_1), colour="red") +
geom_line(data=test.df, aes(x=time,y=Foods_1_CA_1)) + labs(y="Foods 1")
full_42 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Foods_1_CA_2), colour="red") +
geom_line(data=test.df, aes(x=time,y=Foods_1_CA_2)) + labs(y="Foods 1")
full_43 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Foods_1_CA_3), colour="red") +
geom_line(data=test.df, aes(x=time,y=Foods_1_CA_3)) + labs(y="Foods 1")
full_51 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Foods_2_CA_1), colour="red") +
geom_line(data=test.df, aes(x=time,y=Foods_2_CA_1)) + labs(y="Foods 2")
full_52 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Foods_2_CA_2), colour="red") +
geom_line(data=test.df, aes(x=time,y=Foods_2_CA_2)) + labs(y="Foods 2")
full_53 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Foods_2_CA_3), colour="red") +
geom_line(data=test.df, aes(x=time,y=Foods_2_CA_3)) + labs(y="Foods 2")
full_61 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Foods_3_CA_1), colour="red") +
geom_line(data=test.df, aes(x=time,y=Foods_3_CA_1)) + labs(y="Foods 3")
full_62 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Foods_3_CA_2), colour="red") +
geom_line(data=test.df, aes(x=time,y=Foods_3_CA_2)) + labs(y="Foods 3")
full_63 <- ggplot() + geom_line(data=full_pred, aes(x=time,y=Foods_3_CA_3), colour="red") +
geom_line(data=test.df, aes(x=time,y=Foods_3_CA_3)) + labs(y="Foods 3")
# We make several figures for visual reasons
# Store 1
full_fig_s1 <- ggarrange(full_11, full_21, full_31,
full_41, full_51, full_61,
ncol=2, nrow=3)
annotate_figure(full_fig_s1,
top = text_grob("Full Model Forecast, Store 1", face = "bold", size = 14)
)
# Store 2
full_fig_s2 <- ggarrange(full_12, full_22, full_32,
full_42, full_52, full_62,
ncol=2, nrow=3)
annotate_figure(full_fig_s2,
top = text_grob("Full Model Forecast, Store 2", face = "bold", size = 14)
)
# Store 3
full_fig_s3 <- ggarrange(full_13, full_23, full_33,
full_43, full_53, full_63,
ncol=2, nrow=3)
annotate_figure(full_fig_s3,
top = text_grob("Full Model Forecast, Store 3", face = "bold", size = 14)
)
Finally, we’re able to compute our model’s RMSE and RMSSE scores:
# Compute RMSE for each prediction
print("Full Model RMSE:")
## [1] "Full Model RMSE:"
print(sqrt(mean(as.matrix((full_true[c(1576:1969),] - full_pred_full[c(1576:1969),])^2))))
## [1] 168.6594
# Compute RMSSE for each prediction
print("Full Model RMSSE:")
## [1] "Full Model RMSSE:"
print(rmsse(train.df[,],full_pred,test.df[,]))
## [1] 0.1165235
Though our final model, containing all time series, is more precise with regards to RMSSE score, it took much longer to compute. It also feels impossible to scale in the case of larger store distributions with bigger product catalogs. Indeed, the number of variables of this time series grows as a factor of number of shops*number of products. This becomes completely infeasible in real case scenarios. The first two models offer worst results, though still satisfying, and cause less issues with regards to the statement made above. Moreover, the model computed per store appears to be the most satisfying. Indeed, curves show that the evolution of sales are more correlated with regards to the store than the product itself. This feels logical as sales of each product within a store intuitively vary with regards to the number of employees in a store whereas the sale of a same product accross stores may be less simply explainable. Therefore, we find great satisfaction in the models fitted within stores which offer the best compromise between scaling stakes and results.